XML-based Stand-off Representation and Exploitation of Multi-Level Linguistic Annotation
نویسنده
چکیده
This paper deals with the representation of multi-level linguistic annotations. It proposes an XML-based, generic stand-off architecture and presents an example instantiation. Application scenarios that profit from this architecture are sketched out. In recent years, corpus linguistics has become more and more important to a broad community, including people working in theoretical, applied and computational linguistics. To many of them, speech and text corpora represent a rich source of data and phenomena, forming the basis of their research. Benefit from such data is even more important if the data is annotated by suitable information, allowing for fast and effective retrieval of relevant data. Whereas corpora of the first generation featured part-of-speech and syntactic annotations (e.g. PennTreebank [MSM93], TIGER corpus [BDE04]), the focus has now switched to properties beyond the (morpho-)syntactic level. Recent corpora are annotated by semantic information (PropBank [KP02], FrameNet [JPB03], SALSA [EKPP03]), pragmatic information (Penn Discourse TreeBank [MPJW04], RST Discourse Treebank [CMO03], Potsdam Commentary Corpus [Ste04]), and dialogue structure (Switchboard SWBD-DAMSL [JSB97]). Annotations often have to be carried out manually — reliable (semi-)automatic tools exist only for the annotation of part of speech and syntax, and are restricted to well-researched languages like English or German. Moreover, hand-annotated training material is a prerequisite for the development of automatic tools. As a consequence, corpora and annotations ought to be reusable so that a large community can profit from the data. To this end, various standardization efforts have been launched. Standardization of linguistic data concerns (see, e.g., [Sch05]): (i) The physical data structure: here, XML has become the widely-recognized standard format. (ii) The logical data structure: i.e., the data models that are used to model the phenomena and their properties (e.g. hierarchical structures like trees or graphs for syntax annotations 1The research reported in this paper was jointly financed by the German Research Foundation (DFG, SFB632) and the Federal Ministry of Education and Research (BMBF grant no. 03WKH22). Many thanks go to my colleagues, especially Michael Götze, for helpful discussions of the topics addressed in this paper. vs. time-aligned tiers for speech and dialogue annotations). Examples of data models are annotation graphs [BL01] and the NITE Object Model [CKO03b]. (iii) Content: in several initiatives, XML applications for specific linguistic annotations have been developed. For instance, TEI2 (“Text Encoding Initiative”, [SB94]) defines highly-detailed DTDs for encoding all kinds of bibliographic and other information; XCES3 (“XML-based Corpus Encoding Standard”) provides DTDs for the annotation of chunks, alignment, etc. More recently, however, it has been recognized that these standardized DTDs often do not meet application-specific needs. Hence, abstract, generic XML formats have been proposed that allow for the formal integration of application-specific annotations [IR01]. For the conceptual integration of specific annotations, so-called data category repositories as well as linguistic ontologies have been developed. They define reference categories, with precise semantics and examples, that specific annotation tags ought to be mapped to (see, e.g., DOLCE4, “Descriptive Ontology for Linguistic and Cognitive Engineering”). This papers deals with the formal integration of specific annotations. It first addresses the subject of stand-off architecture (sec. 1). We then propose an XML-based representation of linguistic annotation and present an example application (instantiation) in some detail (sec. 2). We also sketch out some application scenarios that profit from such a flexible architecture (sec. 3) and address related approaches (sec. 4). 1 Stand-off Architecture As early as in the mid-nineties, the topic of “stand-off annotation” has been discussed (see, e.g., [TM97]). This term describes the situation where primary data (e.g., the source text) and annotations of this data are stored in separate files. Stand-off annotation might seem problematic, because there is no immediate connection between the text and its annotation; hence, whenever the source text is modified, extra care has to be taken to synchronize its annotation. Similarly, human inspection of the data becomes cumbersome. On the other hand, however, stand-off annotation has the great advantage of leaving the source text untouched. It thus allows for annotating text that cannot be modified for whatever reasons, e.g., because it is a text available on the Internet. Moreover, whereas XML as such does not easily account for overlapping segments and conflicting hierarchies,5 they can be marked in a natural way in stand-off annotation: by distributing annotations over different files. That is, not only is the source text separated from its annotations, but individual annotations are separated from each other as well. This way, annotations at different levels can be created and modified independently of each other. Finally, competing, alternative annotations can even be represented, e.g. variants of part-of-speech annotations that are output of different tools. 2http://www.tei-c.org/ 3http://www.cs.vassar.edu/XCES/ 4http://www.loa-cnr.it/DOLCE.html 5Different methods have been proposed to accommodate conflicting markup into XML. We will come back to them below. One of the first proposals for stand-off annotation of linguistic corpora is [DBD98]. An ISO working group is currently developing the stand-off based LAF6 (“Linguistic Annotation Framework” [IRdlC03]). Some recent corpora like the ANC (“American National Corpus” [RI04]) are encoded in stand-off architecture. In our approach presented in this paper, we also subscribe to the principles of stand-off annotation. 2 A Generic XML Format Our format defines generic XML elements like (markable), (feature), and (structure), which indicate which data type the annotation conforms to. We assume that primary data is stored in a file that optionally specifies a header, followed by a tag , which contains the source text. Annotations are stored in separate files; they may refer to the source text or to other annotations. These relations are encoded by means of XLinks and XPointers. We distinguish three different types of annotations: markables, structures, and features. (i) Markables: tags specify text positions or spans of text (or spans of other markables) that can be annotated by linguistic information. For instance, tags might indicate tokens by specifying ranges of the source text, cf. fig. 1. (ii) Structures: tags are special types of markables. Similar to tags, they specify objects that then can serve as anchors for annotations. Whereas tags define simple types of anchors (flat spans of text or markables), a tag represents a complex anchor involving relations between arbitrarily many markables (including elements). Relations () can be further specified by an attribute type, e.g. as undirected or directed (= pointers). Put differently, a specifies a complete tree or graph, which consists of single tree fragments specified by the tags, cf. fig. 1. (iii) Features: tags specify information annotated to markables or structures, which are referred to by xlink attributes. The type of information (e.g., “part of speech”) is encoded by an attribute type, cf. fig. 2. For instance, the information encoded by the first in fig. 2 can be paraphrased as follows: Take the token that is defined by the tag with the ID attribute id="tok 1" and assign the part of speech “ART” (article) to that token. We intend to adopt the idea of [CKO03a] by assuming that admissible feature values (such as “NN”, normal/common noun, or “NE”, named entity) may be complex types and are organized in a type hierarchy. For instance, “NN” and “NE” might be subtypes of the more general type “N”, noun. tags then point to some type in the hierarchy (which is stored separately), thus specifying the value of the annotated property, cf. fig. 3.7 6ISO Technical TC37/SC4, http://www.tc37sc4.org 7Type hierarchies have to be defined by the user or they may be derived from annotation schemes that incorporate hierarchies, cf. the schemes used by the annotation tool MMAX. In case no hierarchy is defined, the features will be organized in a flat list. The stand-off architecture allows the user to experiment with different hierarchies. Further examples of annotations are sketched out below. They illustrate that annotations may stem from different sources (see the attribute source) and encode various types of information. Categorial annotation (anchored to constituents) ... Coreference annotation, marking coreferential expressions such as pronouns (referred to xlink:href attributes) and their antecedents (identified by target attributes) ... Document structure: headers, paragraphs, lists, etc. (anchored to markables that refer to tokens) ... ... Time alignment: temporal information, specifying starting point and duration (anchored to tokens)8 ... Annotation set: stand-off files that belong together and form one corpus are marked by elements. In the example, text, word-level and syntax annotations are grouped 8In canonical time-aligned annotation, the single annotations refer to time points and spans. In our example, it is the other way round: time-alignment is considered as some sort of annotation. However, our basic units, which are text positions in the examples presented above, may as well consist of points in time rather than points in text. by individual elements. elements can be used to specify properties of these groups (such as “primary data”, “syntax”). In a similar way, groups of annotation sets can be defined to form (sub)corpora. ... 3 Application Scenarios As argued above, stand-off representation has many advantages. For further processing, however, such extensive use of xlinks can be considered problematic for performance. Similarly, our format is certainly not suitable for human inspection and debugging. Inline Versions We therefore envisage the following scenario: Depending on the current application, an inline version is pre-computed, which consists of only those layers that are highly relevant to the application in question. For instance, token and sentence boundaries, word forms, and part-of-speech annotation offer enough information for many applications (and represent exactly the kind of data which traditional corpora used to comprise). Such a condensed, inline version of our above example is displayed below. The attribute records the layers that the annotations have been taken from: token boundaries and word forms stem from the file with ID rabin1.tok; sentence boundaries are encoded by the file with ID rabin1.const cat; finally, part-of-speech annotation is encoded in rabin1.pos. Der Rabin-Attentäter Jigal ... ... Now, for sophisticated applications such as automatic text summarization, more layers are needed and, hence, added to the inline representation, e.g., document structure and sentence relevance,9 as in the following example. Der Rabin-Attentäter Jigal Amir hat ... A summarizing tool that operates on such input probably might also profit from the other annotations. For instance, sentences of the input that are recognized by the summarizer as being highly relevant will be included in the summary. If such a sentence contains a pronoun whose referent (antecedent) has not been extracted, the summarizer would start a fixing procedure: by making use of the stand-off coreference annotation, it would determine the pronoun’s referent and replace the pronoun accordingly. That is, the summarizer takes just as much information into consideration as currently necessary. Similar use cases are linguistic applications like the investigation of certain phenomena (e.g., information structure). Here, the relevant factors are often not known in advance and differ from phenomenon to phenomenon. Hence, it seems sensible to start with a restricted set of “canonical” information and then include more and more annotations in the investigation. This way, the impact of the individual linguistic features (i.e., annotation types) can be observed more directly and easily than by looking at complex annotations simultaneously. Bringing Stand-off Annotations Together Obviously, the more (complex) annotation levels we include in the inline version, the more likely we are to induce conflicting hierarchies. An example of such a conflict involves overlapping syntactic and prosodic chunks (represented in ill-formed XML): syntactic content ... prosodic/syntactic content ... prosodic content ... 9In the example: relevances computed on the base of 4grams, word forms, and porter stems, respectively. The summarizing tool might, e.g., compute the average value of these relevances, or else make use of the relevance types in different ways during processing. Different strategies have been proposed to deal with such conflicts, namely: (cf. [SB94, ch.31], [BBG95]) Milestones: empty elements mark the start and end point of that nesting element which is considered less important syntactic content ... prosodic/syntactic content ... prosodic content ... Fragmentation: the less important nesting element is broken into smaller units + Virtual joins: the tag is added to the fragmentation representation to explicitely mark elements that belong together syntactic content ... prosodic/syntactic content ... prosodic content ... Alternatively, next and prev attributes can be added to the fragments. syntactic content ... prosodic/syntactic content ... prosodic content ... Redundant encoding: multiply-annotated text (prosodic/syntactic content) is duplicated, resulting in multiple files (each of which is inline). Obviously, this is not an option for an efficient exploitation of multiple annotations. syntactic content ... prosodic/syntactic content ... prosodic/syntactic content prosodic content ... Today, there is a very limited number of tools that support creating inline versions of stand-off annotations, e.g. LT XML10. However, LT XML does not allow for conflicting hierarchies. [WGSL05] present a Prolog-based tool of merging two conflicting XML hierarchies by replacing one of the annotations by milestones or fragments. They rely on redundant encoding as the input to their tool. In some preliminary experiments, we successfully applied this tool to a Prolog representation of our sample data, which we created by XLS stylesheets. 10http://www.ltg.ed.ac.uk/software/xml/index.html 4 Related Approaches Most recent work in corpus annotation relies on XML and many projects now make use of stand-off annotations, e.g. ANC [RI04], FrameNet [JPB03], PropBank [KP02]. Most of these projects, however, focus on one or two types of annotation only, such as (morpho-) syntax, or syntax combined with semantics. Semantic annotations like predicate-argument relations typically result in overlapping hierarchies (see, e.g. [KP02], [EKPP03]). Few of the projects address annotations at more than two levels. One such example is the MULI project, which used multi-level annotations for the investigation of information structure, comprising a syntactic, discourse and prosodic level [BBHS04]. Similarly to MULI, we deal with multi-level, heterogeneous annotation. In contrast to them, however, we use a generic XML format to represent the data. Such generic formats have been proposed as interchange formats, e.g., in LAF (Linguistic Annotation Framework [IRdlC03]), AIF (ATLAS Interchange Format [LFGP02]) or TIGER/SALSA XML [EP04]. The exact form of LAF is still under discussion (on the way to becoming an ISO standard), AIF is available in a beta version11. TIGER/SALSA XML has been applied successfully in the SALSA project to encode frames (semantic roles) [EKPP03]. Whereas these formats might in principle host heterogeneous annotation, projects dealing with such data (like MULI) tend to develop task-specific formats. In a way, our work presents a “proof of concept” of such generic formats in the domain of multi-level, heterogeneous annotation. Our standard format currently integrates data annotated by part of speech, morphology and lemma, syntax, rhetorical relations, anaphoric relations, and information structure [Ste04]. Some data is also annotated by phonetic/phonological information (breaks, pitch-range, tones, etc.).12 5 Conclusion and Outlook We presented a generic, stand-off XML representation that allows for flexible integration of various kinds of linguistic information. Annotations from different tools and formats can be mapped to our generic standard format. The stand-off architecture supports the representation of conflicting hierarchies and competing annotations. Exploitation of the data can proceed in a similarly flexible way: in the first run, data to be considered is restricted to often-used, canonical information; additional data is only added upon request. This architecture supports the use and reuse of multiply-annotated data in 11http://www.nist.gov/speech/atlas/develop/aif.html 12The annotations are created by means of different tools: EXMARaLDA (http://www.rrz. uni-hamburg.de/exmaralda/), annotate (http://www.coli.uni-saarland.de/projects/ sfb378/negra-corpus/annotate.html), MMAX (http://mmax.eml-research.de/), and RST Tool (http://www.wagsoft.com/RSTTool/). The export format of these tools is mapped to our standard format. For manual inspection of the data at multiple levels, our project has developed the tool ANNIS, which provides viewing and searching facilities [DGSW04] (http://www.sfb632.uni-potsdam.de/ annis/). many different applications, by offering inline versions of the data that are tailored to the application-specific needs. As one of our next steps, we plan to design a representation of ambiguities and underspecification that fits into our general architecture. A quick solution would be to simply represent all possible interpretations by stand-off files. However, this solution is neither efficiently computable nor does it explicitely represent the actual facts: namely the fact that parts of the data is shared by all files while other parts of it diverge.
منابع مشابه
Multi-dimensional Annotation and Alignment in an English-German Translation Corpus
This paper presents the compilation of the CroCo Corpus, an English-German translation corpus. Corpus design, annotation and alignment are described in detail. In order to guarantee the searchability and exchangeability of the corpus, XML stand-off mark-up is used as representation format for the multi-layer annotation. On this basis it is shown how the corpus can be queried using XQuery. Furth...
متن کاملMAE and MAI: Lightweight Annotation and Adjudication Tools
MAE and MAI are lightweight annotation and adjudication tools for corpus creation. DTDs are used to define the annotation tags and attributes, including extent tags, link tags, and non-consuming tags. Both programs are written in Java and use a stand-alone SQLite database for storage and retrieval of annotation data. Output is in stand-off XML.
متن کاملTools for hierarchical annotation of typed dialogue
We discuss a set of tools for annotating a complex hierarchical and linguistic structure of tutorial dialogue based on the NITE XML Toolkit (NXT) (Carletta et al., 2003). The NXT API supports multi-layered stand-off data annotation and synchronisation with timed and speech data. Using NXT, we built a set of extensible tools for detailed structure annotation of typed tutorial dialogue, collected...
متن کاملDiscontinuous Constituents: a Problematic Case for Parallel Corpora Annotation and Querying
In this paper, we discuss some linguistic phenomena that pose potential problems for multilevel linguistic annotation of parallel corpora in general and specifically for data encoding with state-of-art multilevel corpus querying tools such as CQP. We describe the strategy we use for integrating the standard hierarchical XML representation used to annotate such phenomena in our aligned bilingual...
متن کاملLayering and Merging Linguistic Annotations
The American National Corpus and its annotations are represented in a stand-off XML format compliant with the specifications of ISO TC37 SC4 WG1’s Linguistic Annotation Framework. Because few systems that enable search and access of the corpus currently support stand-off markup, the project has developed a SAX like parser that generates ANC data with annotations in-line, in a variety of output ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005